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Abstract 

We propose a nested Gaussian process (nGP) as a locally adaptive prior for Bayesian nonparametric 
regression. Specified through a set of stochastic differential equations (SDEs), the nGP imposes 
a Gaussian process prior for the function's mth-order derivative. The nesting comes in through 
including a local instantaneous mean function, which is drawn from another Gaussian process 
inducing adaptivity to locally-varying smoothness. We discuss the support of the nGP prior in 
terms of the closure of a reproducing kernel Hilbert space, and consider theoretical properties of 
the posterior. The posterior mean under the nGP prior is shown to be equivalent to the minimizer 
of a nested penalized sum-of-squares involving penalties for both the global and local roughness of 
the function. Using highly-efficient Markov chain Monte Carlo for posterior inference, the proposed 
method performs well in simulation studies compared to several alternatives, and is scalable to 
massive data, illustrated through a proteomics application. 
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1 Introduction 



We consider the nonparametric regression problem 

Y(t) = U(t)+e(t), teT = [hM, 



(1) 



where U : T — > R is an unknown mean regression function to be estimated at T Q = {to, tx, ■ ■ ■ , tj < 
t v }, t = 0, and e = [e(ti), e(t 2 ), • • ■ , e{tj)}' ~ Nj(0, of I) a J-dimensional multivariate normal dis- 
tribution with mean vector and covariance matrix of J. We are particularly interested in allowing 
the smoothness of U to vary locally as a function of t. For example, consider the protein mass 
spectrometry data in panel (a) of Figure [TJ There are clearly regions of t across which the function 
is very smooth and other regions in which there are distinct spikes, with these spikes being quite 
important. An additional challenge is that the data are generated in a high-throughput experiment 
with J = 11,186 observations. Hence, we need a statistical model which allows locally- varying 
smoothness, while also permitting efficient computation even when data are available at a large 
number of locations along the function. 

[Figure 1 about here.] 

A commonly used approach for nonparametric regression is t o pla ce a Gaussian process (GP) 



prior (jNeal . 



1998 



Rasmussen and Williams 



2006 



Shi and Choi 



20111 ) on the unknown U, where 



the GP is usually specified by its mean and covariance function (e.g. squared exponential). The 
posterior distribution of U(T ) can be conveniently obtained as a multivariate Gaussian distribu- 
tion. When carefully-chosen hyperpriors are placed on the parameters in the covariance kernel, 



GP priors have been show n to lead to large support, posterior consistency (IGhosal and Roy 



2006 



Choi and Schervish. 



2007 ) and even ne ar minimax optimal adaptive rates of posterior contraction 



( Van der Vaart and Van Zanten 



2008al ). However, the focus of this literature has been on isotropic 



Gaussian processes, which have a single bandwidth parameter controlling global smoothness, with 
the contraction rate theory assuming the true function has a single smoothness level. There has 
been applied work allowing the smoothness of a multivariate regression surface to vary in different 
directions by u s ing predictor-specific b a ndwidths in a GP with a squared expon ential covariance 



( Savitsky et al. 



2011 



Zou et al. 



2010). iBhattacharya. Pati. and Dunson 



(120111 ) recently showed 



that a carefully-scaled anisotropic GP leads to minimax optimal adaptive rates in anisotropic func- 
tion classes including when the true function depends on a subset of the predictors. However, the 
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focus was on allowing a single smoothness level for each predictor, while our current interest is 
allowing smoothness to vary locally in nonparametric regression in a single predictor. 

There is a rich literature on locally- varying smoothing. One popular approach relies on free knot 
splines, for which various strategies have been proposed to select the n umber of knots and their 



locat i ons, including; st e pwise forward and /or backward knots selection ([Friedman and Silverman . 



1989 



Friedman 



1991 



Luo and Wahba 



2001 ) and Bayesian knots selection ( 



20011) via Gi 



Carlo ((Green 



1997), accurate knots selection scheme ( 



Smith and Kohnl. 



) bs sam pling (IGeorge and McCulloch 



1996 



Denison et al. 



1998 



Zho u and Shen . 



Dimatteo et al. 



19931 ) or reversible jump Markov chain Monte 



19951 ) . Although many of these methods perform well in simulations, such free knot 



approaches tend to be highly computationally demanding making their implementation in massive 
data sets problematic. 

In addition to free knot methods, adaptive penalization approaches have also been proposed. 
An estimate of U is obtained as the minimizer of a penalize d sum o f squa r es including a rough- 



ness penalty with a spatially-varying smoot 



2000 



Pintore et al 



wavelet shrinkage 
(IFan and Gijbelsl . 



2006 



mess parameter rtWahba 



Crainiceanu et 



(IDonoho and Johnstone 



mixture of splines ([Wood et al 



(jWofpert et al. 



al. 



1995 



Ruppert and Carroll . 



20071 ) . Other smoothness adaptive methods include 



1995 ), local polynomial fitting with variable bandwidt h 



1995 ), L-spline (lAbramovich and Steinberg 



1996 



Heckman and Ramsay 



2000|) 



20021 ) and linear combination of kernels with varying bandwidths 



20111 ) . The common theme of these approaches is to reduce the constraint on 



the single smoothness level assumption and to implicitly allow the derivatives of U, a common 
measurement of the smoothness of U, to vary over t. 

In this paper, we instead propose a nested Gaussian process (nGP) prior to explicitly model the 
expectation of the derivative of U as a function of t and to make full Bayesian inference using an 
efficient Markov chain Monte Carlo (MCMC) algorithm scalable to massive data. More formally, 
our nGP prior specifies a GP for t/'s mth-order derivative D m U centered on a local instantaneous 
mean function A : T — > R which is in turn drawn from another GP. Both GPs are defined 
by stoch astic differential e quations (SDEs), related to the method proposed by 



Zhu et al. 



However, 



Zhu et al. 



( 1201 lh . 



( 1201 ll ) centered their process on a parametric model, while we instead center on 



a higher- level GP to allow nonparametric locally- adaptive smo othing. Along with the obser vation 



equation ([I]), SDEs can be reformulated as a state space model (IDurbin and Koopmanl. 



reformulation facilitates the application of simulation smoother (IDurbin and Koopman , 



2001). This 



2002|), an 



efficient MCMC algorithm with O(J) computational complexity which is essential to deal with 
large scale data. We will show that the nGP prior has large support and its posterior distribution 
is asymptotically consistent. In addition, the posterior mean or mode of U under the nGP prior 
can be shown to correspond to the minimizer of a penalized sum of squares with nested penalty 
functions. 

The remainder of the paper is organized as follows. Section [2] defines the nGP prior and discusses 
some of its properties. Section [3] outlines an efficient Markov chain Monte Carlo (MCMC) algorithm 
for posterior computation. Section H] presents simulation studies. The proposed method is applied 
to a mass spectra dataset in Section Finally, Section contains several concluding remarks and 
outlines some future directions. 

2 Nested Gaussian Process Prior 
2.1 Definition and Properties 

The nGP defines a GP prior for the mean regression function U and the local instantaneous mean 
function A through the following SDEs with parameters ay £ R + and a A £ K + : 

D m U(t) = A(t) + <TuW v (t), m £ N > 2 (2) 
D n A(t) = a A W A {t), n £ N > 1 (3) 

where Wu(t) and WA{t) are two independent Gaussian white noise processes with mean function 
E{W v (t)} = E{W A (t)} = and covariance function E{Wu(t)Wu(t')} = E{W A (t)W A (t')} = 5(t-t') 
a delta function. The initial value of U and its derivatives up to order m — 1 at to = are denoted 
as \x = (hq, ■ ■ ■ ,fj,m-i)' ~ N m (0, cr^J). Similarly, the initial values of A and its derivatives till 
order n — 1 at to = are denoted as a = (ao, ai, • • • , ot n -\)' ~ N n (0, cr^J). In addition, we assume 
that /x, a, Wjj(-) and are mutually independent. The definition of nGP naturally induces a 

prior for U with varying smoothness. Indeed, the SDE (J2J) suggests that E{D m U(t) \ A(t)} = A(t). 
Thus, the smoothness of U, measured by D m U, is expected to be centered on a function A varying 
over t. 

We first recall the definition of the reproducing kernel Hilbert space (RKHS) generated by the 
zero-mean Gaussian process W = {W(t) : t £ T} and the results on the support of W , which will 
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be useful to explore the theoretical properties of the nGP prior. Let (Q, A, P) be the probability 
space for W such that for any t x , t 2 ,...,t k G T with k e N , {Wfa), W(t 2 ), . . . , W(t k )}' follow a 
zero-mean multivariate normal distribution with covariance matrix induced through the covariance 
function K, w :TxT-fR, defined by JC w {s,t) = E{W{s)W{t)}. The RKHS Hk w generated by 
W is the completion of the linear space of all functions 

k 

aitCw(si, t), a u . . . , a k e R , s x , . . . , s k e T, k e N , 



i=i 



with the inner product 



i=i j=i / w *=i j=i 



Ml" 



which satisfies the reproducing property /(t) = (f,JCw(t, •))?/ f° r an y / e %Cw : 7" — > 

With the speci fication of the RKHS %k. w , we are able t o define the support of W as the closure of 
H^lLemma 5.1, 



Van der Vaart and Van Zanten 



'08bl ). 



the support of the nGP prior, which is formally stated in Theorem [TJ Its proof requires the results 
of the following lemma. 

Lemma 1. The nested Gaussian process U can be written as U(t) = Uo(t) + Ui(t) + A (t) + Ai(t), 
the summation of mutually independent Gaussian processes with the corresponding mean functions 
E{U (t)} = E{Ui(t)} = E{A (t)} = E{Ai(t)} = and covariance functions 



m—l 



i=0 

Kfjjjs.t) = afjTZfj^s^) = a u J G m (s,u)G m (t,u)du, 

n-l 

^(M) = 0- 2 a n Ao (s,t) = al^, (f>m+i(s)(f>m+i(t), 

8=0 

^AiOM) = °i^Ai( s >*) = a A J G m+n (s,u)G m+n (t,u)d 



respectively, where 4>i(t) = ^ and G m (s,u) = ^ m _\y_ ■ 
The proof is in Appendix |A1 

Theorem 1. The support of nested Gaussian process U is the closure of RKHS TLku = ® 
TLk- ®Wk- ®%k- > the direct sum of RKHSs Hk- > Hk- > Hk- and Hk- with reproducing 

U 1 A A 1 U Ui A Q A 1 

kernels K.jj o (s,t) , Kjy^s.t), JC^ (s,t) and JC Al (s,t) respectively. 
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The proof is in Appendix [A] By Corollary [TJ it is of interest to note that %k : U includes a subspace 



Hk-i which is the RKHS for the polynomial smoothing spline ( jWahbal . Il990l Section 1.5). 



Corollary 1. The support of the Gaussian process U — Uq + U\ as the prior for polynomial smooth- 
ing spline is the closure of RKHS Hr- = H/c- © Hk- with C Hku- 

The proof is in Appendix El Hence, it is obvious that the nGP prior includes GP prior for polynomial 
smoothing spline as a special case when a 2 a — > and a\ — > 0. 

The nGP prior can generate functions U arbitrarily close to any function Uq in the support of 
the prior. From Theorem [1] it is clear that the support is large and hence the sample paths from 
the proposed prior can approximate any function in a broad class. As a stronger property, it is also 
appealing that the posterior distribution concentrate in arbitrarily small neighborhoods of the true 
function Uq which generated the data as the sample size J increases, with this property referred 
to as posterior consistency. More formally, a prior II on achieves posterior consistency at the 
true parameter 6 if for any neighborhoods U e , the posterior distribution II (U e | Yi, Y 2 , . . . , Yj) — > 1 
almost surely under ILj , the true joint distribution of observations {Yj}j =1 . For our case, the 
parameters 6 = (U, a £ ) lie in the product space = H/Cu x M + and have a prior Hg = Uu x IT^, 
for which 11^ is an nGP prior for U and H a£ is a prior distribution for a £ . The L\ neighborhood of 
#o = (U ,a £fi ) is defined as U t = l{U,a e ) : \ \U - U \\i = f Q tu \U(t) - U (t)\dt < e, \a £ - a £fi \ < ej. 
We further specify a couple of regularity conditions given by: 

Assumption 1. tj arises according to an infill design: for each Sj = tj+\ — tj, there exists a 
constant < Cd < 1 such that maxi<j<jr Sj < -^-j. 

Assumption 2. The prior distributions I1 CT 2, iT CT 2 ; and 11^2 satisfy an exponential tail con- 
dition. Specifically, there exist sequences Mj, o~ 2 j, o^ j, a 2 aJ and a 2 AJ such that: (i) U a 2(cr 2 > 
<j) = ^{al > a 2 UtJ ) = e~ c - J , U a ,(a 2 a > a 2 a>J ) = e~ c « J and > a\j) = e~ c - J , 

for some positive constants C^, Cjj, C a and Ca; (H) MjaJ 2 > C g J , for every C g > and aj 2 , the 
minimal element of {<t~ j, j, cr~ j, o~ A 2 j}. 

Assumption 3. The prior distribution H ae is continuous and the cr E ,o lies in the support of II CTe 

Under those specifications and regularity conditions, the results on strong posterior consistency for 
the Bayes nonparametric regression with nGP prior is given as follows. 
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Theorem 2. Let {Yj} J - =1 be the independent but non-identical observations following normal dis- 
tributions {Ni(U(tj),a^)}j =1 with unknown mean function U and unknown er| at design points 
ti, t2, . . . tj. Suppose U follows an nGP prior and the Assumptions [Tpl and\^ hold. Then for every 
9q G and every e > 0, 



n (U e | Y u Y 2 , . . . ,Yj) -> 1 a.s. under IT 



e - 



The proof is based on the strong consistency theorem by 
in Appendix [A1 



Choi and Schervishl (120071 ) and is detailed 



2.2 Connection to Nested Smoothing Spline 

We show in Theorem H] that the posterior mean of U under an nGP prior can be related to the 
minimizer, namely the nested smoothing spline (nSS) U, of the following penalized sum-of-squares 
with nested penalties, 




where Xjj G M + and Xa G M + are the smoothing parameters which control the smoothness of 
unknown functions U(t) and A(t) respectively. The following Theorem [3] and Corollary [2] provide 
the explicit forms for nSS, for which the proofs are included in Appendix lAl 

Theorem 3. The nested smoothing spline U(t) has the form 

m—l J n—l J 

u(t) = mMt) + E {t^ t) + j2 (t) + J2PjK Al (tj,t) 

i=0 j=l i=0 j=l 

= fji'^t) + u'R^t) + oc'(j> a {t) + (3'R A (t), 

where /x = (yLt , • ■ ■ ,A*m-i)'> v = • ■ ■ , Vj)' , at = {a ,a u ■ ■ ■ ,a n _i)' and (3 = (/3i,/3 2 , ■ • ■ 

are the coefficients for the bases 

<f>»(t) = {Mt),Mt),--- ,</>m-i{t)y, R&(t) = {n^^^n^t), - ■ ■ ,n 0i (tj,t)}', 

<t>a(t) = {0m(*),0m+l (J),"" , 0m+n-l(*)}', R A (t) = {K Al (t 1 ,t),K Al (t 2 ,t),---,n Al (tj,t)}'. 
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In addition, the nested penalized sum-of-squares can be written as 

nPSS{t) =i (Y - <t>^ - R u - <p a ci - R A (3)' (Y - (j>„n - R u - <p a cx - R A p) 
+ Xuv'RfjV + X A f3'R A (3, 

where 

Y = {Y(t 1 ),Y{t 1 ) r -- ,Y(tj)}', 

^ = {0 M (tO, M (t 2 ), • • ■ , ^(tj)}', <(> a = {4> a (h), <j) a {t 2 ), ■ ■ • , <t> a (tj)}', 

Corollary 2. T7ie coefficients [i, v, a and (3 of the nested smoothing spline U(t) in Theorem^ 
are given as 

/* = s;i 1 Ai«. flf " ly '' 

w/iere 0^ = 0^ — S^S^^, Q | M = <fJ a — £^£^0^ S M | a = S MM — S^S^S^, S Q ^ = 
S aa - S^S^S^, S w = 0j 1 S~ 1 /i , = 0j i( S , ~ 1 0a> S QAt = / Q< S _1 At , S aQ = (f)' a S~ 1 (j) a and 
S = M + %R A = R + JXuI + 

Corollary 3. Lei B M = E^^" 1 , = S" 1 {/ - (0^0^ + 0*2^0^) S -1 }, £ Q = 
S~|^ t Q | M S'^ 1 and Bp = j^B u . The nested smoothing spline U(t) is a linear smoother, expressed in 
the matrix form as, U = K\ V> \ A Y, where K\ Vt x A = (fr^B^ + RqB u + 4> a B a + R A Bp. 

The proof is straightforward by applying Theorem [3] and Corollary [2j As a linear smoother, nSS 
estimates the mean function by a linear combination of observations with the weight matrix Kx v ,\ A - 
Theorem H] below shows the main result of this subsection, i.e. the posterior mean of U under 
the nGP prior is equivalent to the nSS U when a 2 — > oo and cr^, — > oo. The proof is in Appendix 
IA1 and is based on the following results of Lemma [21 
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Lemma 2. For the observations Y = {Y(t 1 ),Y(t2), ■ ■ ■ ,Y(tj)}' and the nested Gaussian process 
U(t), we have 

E{U(t)} = 0, 
E{Y} = 0, 

Cov{U(t), Y} = oj^(0^, + oi&zit) + <J 2 a ct>' a (t)<f>' a + a 2 A R' A (t), 
Cov{Y, Y} = + alrRfj + a 2 a <f> a <f>' a + a\R A + a 2 I. 

Theorem 4. Given observations Y = {Y(ti),Y(t2), ■ ■ ■ ,Y(tj)}', the posterior mean ofU(t) with 
nested Gaussian process prior is denoted as U a 2 a 2(t) = E{U(t) | Y, a 2 , a^, a 2 }. We have 

lim lim U a2 2(t) = U(t), 

cr^ — > oo a g — > oo 

where U (t) is the nested smoothing spline. 



3 Posterior Computation 

To complete a Bayesian specification, we choose priors for the initial values, covariance parameters 
in the nGP and residual variance. In particular, we let n ~ N m (0,cr^/), a ~ N m (0, a^J), a 2 ~ 
invGamma(a, b), afj ~ invGamma(a, b) and afj ~ invGamma(a, b), where invGamma(a, b) denotes the 
inverse gamma distribution with shape parameter a and scale parameter b. In the applications 
shown below, the data are rescaled so that the absolute value of the maximum observation is less 
than 100. We choose diffuse but proper priors by letting a" 1 = a" 1 = a = b = 0.01 as a default 
to allow the data to inform strongly, and have observed good performance in a variety of settings 
for this choice. In practice, we have found the posterior distributions for these hyperparameters to 
be substantially more concentrated than the prior in applications we have considered, suggesting 
substantial Bayesian learning. 

With this prior specification, we propose an MCMC algorithm for posterior computation. This 
algorithm consists of two iterative steps: (1) Given the a 2 , a^, a\ and Y, draw posterior samples 
of n, U = {Ufa), U(t 2 ), U(tj)}', a and A = {Afc), A(t 2 ), . . . , A(tj)}'; (2) Given the /x, U, a, 
A and Y, draw posterior samples of a 2 , afj and o\. 

In the first step, it would seem natural to draw U and A from their multivariate normal con- 
ditional posterior distributions. However, this is extremely expensive computationally in high di- 
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mensions involving 0(J 3 ) computations in inverting J x J covariance matrices, which do not have 
any sparsity structure that can be exploited. To reduce this computatio nal bottleneck in GP mod 



els, there is a rich literature relying on low rank matrix approxim ations ( iSmola and Bartlett 



Lawrence et al 



2002 



Quinonero-Candela and Rasmussen 



2001 



20051 ). Of course, such low rank approx- 



imations introduce some associated approximation error, with the magnitude of this error unknown 
but potentially substantial in our motivating mass spectrometry applications, as it is not clear that 
typical approximations having sufficiently low rank to be computationally feasible can be accurate. 

To bypass the need for such approximations, we propose a different approach that does not 
require inverting J x J covariance matrices but instead exploits the Markovian property implied 
by SDEs (j2J) and ([3]). The Markovian property is represented by a stochastic difference equation, 
namely the state equation, as illustrated for the case when m = 2 and n — 1 in Proposition [TJ which 
is easily extended to cases with higher order of m and n. 

Proposition 1. When m = 2 and n = 1, nested Gaussian process U(t) along with its first order 
derivative -D 1 [/(t) and A(t) follow the state equation: 



j+1 - GjOj + ujj. 



where j+1 = {U(t j+1 ), D l U{t j+1 ), A(t j+1 )}' , Wj ~ N 3 (0, W-), G j 



( 



1 5 j 
1 




5 1 \ 

2 



5 



j 



and Wj 



( ' 



3 a U 



20 a A 2 a U + 8 °A e a A 



1 ) 



\ 



'*„2 



with 5n 



fa A fa A 8jO- A j 

The proof is in Appe ndix Rl The state equation combined with the ob s ervat ion equation (JTJ) forms 



a state space model (IWest and Harrison 



19971 : 



Durbin and Koopman 



200l[). for which the latent 



state s dj's can be efficiently sampled by a simulation smoother algorithm ( IDurbin and Koopmanl . 
20021 ) with O(J) computation complexity. 

Given the /x, U, a and A, posterior samples of of can be obtained by drawing from the inverse- 
gamma conditional posterior while afj and a A can be updated in Metropolis-Hastings (MH) steps. 
We have found that typical MH random walk steps tend to be sticky and it is preferable to use MH 
independence chain proposals in which one samples candidates for a\j and a\ from approximations 
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to their conditional posteriors that are easy to sample from. To accomplish this, we rely on the 
following proposition. 

Proposition 2. When 5j is sufficient small, the state equation in Proposition^ can be approximated 
by 

6 j+1 = Gj6j + Hjujj, 



' 1 6 j 0^ 



where u>j ~ N 2 I 0, Wj , G 3 



1 8i 



\ 



1 
1 



and Wj 



afjbj 







*A S 3 



1 J 

The above approximate st ate equation is derived by applying the Euler approximation (chapter 9, 



Kloeden and Platen . 



19921 ). essentially a first-order Taylor approximation, to the SDEs (jzj) and ([3]). 
Given the O/s, the afj and a\ in the above approximate state equation can be easily sampled as a 
Bayesian linear regression model with the given coefficients. 

Finally, we outline the proposed MCMC algorithm as follows: 

(1) . For the state space model with the observation equation (CEJ) and the state equation in Propo- 
sition [1], update the latent states /x, U, ol and A by using the simulation smoother. 

(2) . Sample of from the posterior distribution invGamma (a + \ J, b + ~ J2j=i {Y(tj) ~ U(tj)} 2 ^j. 

(3a). Given of, afj- and a A , we sample the latent states /x*, U*, ex* and A* for the approximate 
state space model with the observation equation flTJ and the approximate state equation specified 
in Proposition [2J 

(3b). Given fi*, U* , at* and A*, the proposal afj* and a\* is drawn from the posterior distributions 



invGamma ( a + \ J, b + \ Y^j=o 



1 v^J-l {DU^tj+^-DU'^-A'^SjY 



and 



invGamma ( a + \ J, 6 + | ^ i=0 



1 {A-fo+O-A'fo)}' 



respectively. 



(3c). The proposal ofr* and cr|* will be accepted with the probability 

j-i / Ni3 - | o, w*) h,2 (h,(o* +1 - g,o*) \ o, w 3 

~ 1 

=o /n, 3 - Gj&j | 0, W,) / N)2 [Hj(0* +1 - G,-0*) | 0, W 3 



where f^ t k(X | 0, S) denotes the probability density function of the k- dimensional normal random 
vector with mean and covariance matrix X; 0j, Wj and Wj are specified in Proposition [1] and 
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[2j Similar notions hold for 0*, W* and W" - with /z, U, a A, cr^ and a\ replaced by /x*, U*, a* 
A*, afj* and a\* correspondingly. 



4 Simulations 

We conducted a simulation study to assess the performance of the proposed method, Bayesian 
nonparametric regression via an nGP pr ior (BNR-nGP ), and compared it to several alternative 



methods: cubic smo othing spline (SS 



threshold (Wavelet 1, 



Wahba, 



Donoho and Johnstond . 



estimate of ris k for threshold choice (W avelet2, 
splines (HAS, 



19901 ). wavelet method with the soft minimax 



19941). wavelet method with the soft Stein's unbiased 



Donoho and Johnstone 



Luo and Wahba . 



19951 ) and hybrid adaptive 



19971 ) . For BNR-nGP, we take the posterior mean as the estimate, 
which is based on the draws from the proposed MCMC algorithm with 1,500 iterations, discarding 
the fir st 500 as the burn-in stage and sav ing remaining ones. The other methods are implemented 



in R (|R Development Core Team 



meth ods (wmtsa, 
20111). 



2011), a 



Const ant ine and Per rival 



ong w ith the corresponding R packages f or Wavelet 



20101 ) and hybrid adaptive splines (bsml, 



Wu et al 



Our first simulation study focuses on four functions adapted from lDonoho and Johnstond (119941 ) 
with different types of locally- varying smoothness. The functions are plotted in Figure (2J for which 
the smoothness levels vary, for example, abruptly in panel (a) or gradually in panel (d). For each 
function, equally-spaced observations are obtained with Gaussian noise, for which the signal-to-noise 



SD(U) 
OS 



= 7. We use the mean squared error (MSE) j ^2j=i{U{tj) — U (tj)} 2 to compare the 



performance of different methods based on 100 replicates. The simulation results are summarized 
in Tabled] Among all methods, SS performs worst, which is not surprising since it can not adapt to 
the locally- varying smoothness. Among the remaining methods, BNR-nGP performs well in general 
for all cases with either the smallest or the second smallest average MSE across 100 replicates. In 
contrast, Wavelet2 and HAS may perform better for a given function, but their performances are 
obviously inferior for another function (e.g. Heavisine for Wavelet2 and Doppler for HAS). This 
suggests the nGP prior is able to adapt to a wide variety of locally-varying smoothness profiles. 
We further compare the proposed method and the alternative methods for anal yzing mass spec 



trome try data. The 100 datasets are generated by the 'virtual mass spectrometer' ( jCoombes et al 



2005al ). which considers the physical principles of the instrument. One set of these simulated data is 
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plotted in Figure [3] w ith o e = 66. The simulated data have been shown to accurately mimic real data 



( Morris et al. 



20051 ) and are available at http://bioinformatics.mdanderson.org/Supplements/Datasets/ 



Since the analysis of all observations (J=20,695) of a given dataset is computational infeasible for 
HAS, we focus on the analysis of the observations within two regions with 5 < km/z < 8 (region 1 
with J=2,524) and 20 < km/z < 25 (region 2 with J=2,235) respectively. Those two regions rep- 
resent the unique feature of mass spectrometry data. More specifically, with smaller km/z values 
the peaks are much taller and sharper than the peaks in the region with larger km/z values. The 
results in Table [T] indicate that the BNR-nGP performs better than the other smoothness adap- 
tive methods for both regions in terms of smaller average MSE and narrower interquartile range of 
MSE. Although the smoothing spline seems to work well with smaller average MSE in region 2, the 
peaks are clearly over-smoothed, leading to large MSEs at these important locations. In contrast, 
BNR-nGP had excellent performance relative to the competitors across locations. 

[Figure 2 about here.] 
[Figure 3 about here.] 
[Table 1 about here.] 



5 Applications 



We apply the proposed method to protein mass spectrometry (MS) data. Prot ein MS plays an im 



portant role in proteomics for identifying disease-rela t ed proteins in the samples (ICottrell and London 



1999 



Tibshirani et al. 



2004 



Domon and Aebersold 



2006 



Morris et al. 



20081 ). For example, Panel 



(a) of Figure [T] plots 11, 186 intensities in a pooled sample of nipple asp irate fluid from health y 
breasts and breasts with cancer versus the mass to charge ratio m/z of ions (ICoombes et all l2005bl ). 
Analysis of protein MS data involves several steps, including spectra alignment, signal extraction, 
baseline subtraction, normalization and peak detection. As an illustration of our method, we focus 
on the second step, i.e., estimate the intensity function adjusted for measurement errors. Peaks in 
the intensity function may correspond to proteins that differ in the expression levels between cancer 
and control patients. 

We fit the Bayes nonparametric regression with nGP prior and ran the MCMC algorithm for 
11,000 iterations with the first 1000 iterations discarded as burn-in and every 10th draw retained 
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for analysis. The trace plots and autocorrelation plots suggested the algorithm converged fast and 
mixed well. Panel (b) of Figure [TJ plots the posterior mean of U and its pointwise 95% credible 
interval. Note that the posterior mean of U is adapted to the various smoothness at different 
regions, which is more apparently illustrated by the Panel (c) of Figure [TJ Panel (d) of Figure [TJ 
demonstrates the posterior mean and 95% credible interval of rate of intensity change DU, which 
suggests a peak around 4 km/z. 

6 Discussion 

We have proposed a novel nested Gaussian process prior, which is designed for flexible nonparametric 
locally adaptive smoothing while facilitating efficient computation even in large data sets. Most 
approaches for Bayesian locally adaptive smoothing, such as free knot splines and kernel regression 
with varying bandwidths, encounter substantial problems with scalability. Even isotropic Gaussian 
processes, which provide a widely used and studied prior for nonparametric regression, face well 
known issues in large data sets, with standard approaches for speeding up computation relying on 
low rank approximations. It is typically not possible to assess the accuracy of such approximations 
and whether a low rank assumption is warranted for a particular data set. However, when the 
function of interest is not smooth but can have very many local bumps and features, high resolution 
data may be intrinsically needed to obtain an accurate estimate of local features of the function, with 
low rank approximations having poor accuracy. This seems to be the case in mass spectroscopy 
applications, such as the motivating proteomics example we considered in Section [5J We have 
simultaneously addressed two fundamental limitations of typical isotropic Gaussian process priors 
for nonparametric Bayes regression: (i) the lack of spatially- varying smoothness; and (ii) the lack 
of scalability to large sample sizes. In addition, this was accomplished in a single coherent Bayesian 
probability model that fully accounts for uncertainty in the function without relying on multistage 
estimation. 

Although we have provided an initial study of some basic theoretical properties, the fundamental 
motivation in this paper is to obtain a practically useful method. We hope that this initial work 
stimulates additional research along several interesting lines. The first relates to generalizing the 
models and computational algorithms to multivariate regression surfaces. Seemingly this will be 
straightforward to accomplish using additive models and tensor product specifications. The second 
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is to allow for the incorporation of prior knowledge regarding the shapes of the functions; in some 
applications, there is information available in the form of differential equations or even a rough 
knowledge of the types of curves one anticipates, which could ideally be incorporated into an nGP 
prior. Finally, there are several interesting theoretical directions, such as showing rates of posterior 
contraction for true functions belonging to a spatially-varying smoothness class. 



A Appendix: Proofs of Theoretical Results 
A.l Proof of Lemma [I] 

We specify U(t) = U(t) + A(t) and A(t) = D m A(t). By SDEs © and ©, 

D m U(t) = auWu(t), (5) 
D m+n A(t) = a A W A (t). (6) 

By applying stochastic integration to SDEs (J5J) and (jBJ), it can be shown that 

m— 1 „ 

U(t) = U (t) + Ux(t) = y2fr<f>i(t) + °u / G m (t : u)Wu(u)du, 

n—1 „ 

A(t) = A (t) + A\(t) = J2a i( f) m+i (t) +a 2 A G m+n (t,u)W A (u)du 

i=o Jr 

given the initial values /j, and a. Since Uo(t), Ui(t), A (t) and Ai(t) are the linear combination 
of Gaussian random variables at every t, they are Gaussian processes defined over t, whose mean 
functions and covariance functions can be easily derived as required. In addition, Uo(t), Ui(t), Ao(t) 
and Ai(t) are mutually independent due to the mutually independent assumption of fx, a, Wu(') 
and in the definition of nGP. 



A. 2 Proof of Theorem [T] 



We aim to characterize the RKHS of U with the reproducin g kernel K,n(s,t). The support o 



U, a m ean-zero Gaussian random element, is the closure of H/Cu (IVan der Vaart and Van Zanten 



2008b. Lemma 5.1). 



By Loeve's Theorem (IBerlinet and Thomas- A gnanl . 120041 . Theorem 35), the RKHSs generated 



by the processes Uo(t), U\(t), A (t) and A\(t) with covariance functions JCfj o (s, t), JCfj^s, t), JC^ Q (s, t) 

15 



and K. Ai (s,t) (given in Lemma ED) are congruent to RKHSs , ^ ^g- > an< ^ j res P ec 



tively. Based on Theorem 5 of 



Berlinet and Thomas-Agnanl (120041 ). we conclude that K.jj(s,t) 



JCfj o (s,t) + JCjy i (s,t) + K.^ o (s,t) + )C Al (s,t) is the reproducing kernel of the RKHS 



u 



n 



{U(t) : U(t) = U (t) + U 1 (t) + A (t) + A 1 (t), 

U {t) e H^U^t) e H^Mt) e H KAo ,Mt) e Hk Ai } 



A.3 Proof of Theorem H 

Similar to the proof of Lemma [H we specify £7 = E/q + E/i + Aq + A±, which is a mean zero Gaussian 
process with continuous and differentiable covariance function fCu(s,t) = Kq ■ (s,t) + K.(j (s,t) + 
JC Ao {s,t)+lC Al {s,t). 

y th e sufficient conditions of the strong consistency theorem (Theorem 1, 



We aims to veri 



Choi and Schervish 



20071 ) for nonparametric regression: (I) prior positivity of neighborhoods and 
(II) existence of uniformly exponentially consistent tests and sieves 0j with rLj(Oj) < C\ exp(— C 2 J) 
for some positive constants C\ and C 2 - 

Given U is a Gaussian pro cess with continuous sa mple path and continuous covariance function, 



it follows from Theorem 4 of 



Ghosal and Rov! (120061 ) that U V (\\U - E/o|U < 5) > for any 5 > 0. 



In addition, for every 5 > 0, 11^ (l^^ ~~ 1| < °) > under Assumption [3j Hence, we can define a 
neighborhood B$ = |([/, er £ ) : \\U — t/o||oo < ^ 1^ — 1| < such that ^-(u,a e )(Bs) > satisfying 
the condition (I). 



From Theorem 2 of 



Choi and Schervish! (120071 ). we can show that for a sequence of Mj, there 



exist uniformly exponentially consistent tests for the sieves 0j = {U : ||C/||oo < Mj, \\DU\loo < Mj} 
under the infill design Assumption [TJ What remains is to verify the exponentially small probability of 
ej = fU : WUWm > MA and Q C T , = \U : ||£>£/||oc > Mj}. Using Borell's inequality (Proposition 



19961 ). we have 



A. 2. 7, iVan der Vaart and Wellnerl . 

U u (\\U\\ 0O >Mj)<C 1 exp 



C 3 Mj 



a' 
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for some positive constants C\ and C3, and 



a 2 := sup \ E(U + U 1 + A + A 1 



sup < EUq + EUf + EAq + EA\ 

t€{0,tu] L 

m -! _2j.2m-l 
^2 ±2/+ \ , a U L U 



i=0 
n-1 



(m- l)!(m-l)!(2m- 1) 



2 + 2m+2n-l 



i=0 



(m + n — l)!(m + n — l)!(2m + In — 1) 



By applying the Borel-Cantelli theorem, we have 

n f/ (||t/|| 00 >M J )<C 1 exp(-C , 2 J), 

almost surely under the exponential tail Assumption [2j By the similar arguments, we can show 
that ILy (||ZH7||oo > Mj) < C\ exp(-C 2 J). 

Hence, the conditions (I) and (II) hold, which leads to the strong consistency for Bayesian 
nonparametric regression with nGP prior. 



A. 4 Proof of Corollary ffl 

Note that U{t) = U (t) + UAt) = Y^ 1 UiA(t) + J T G m (t,u)Wu(u)du is the prior for the 
polynomial smoothing spline (jWahbal . Il990l . Section 1.5). By the similar arguments in Theorem [H 



we can show that the support of U is the closure of RKHS Hjc & = Hfc^ © • 



Thus, K,u{s,t) — K,jj(s,t) = /Cjfg, t) 



C WiCu by Corollary 4 of lAronszajnl (I1950I ). 



Wk,- a nonnegative kernel, which implies that 



A. 5 Proof of Theorem [3] 

Let U(t) = U(t) +A(t) and A(t) = D m A(t). The nested penalized sum-of-square (jl]) can be written 
as: 



1 2 /* 2 /* 2 

nPSS(t) = -J2{Y{t j )-U{t j )-A{t j )} +Xuj [D m U(t)j dt + \ A J {/) m+n i(t)} dt, (7) 



where U(t) is the m-order polynomial smoothing spline and A(t) is the (m + n)-order polynomial 
smoothing spline. 
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By the classical RKHS theory of the polynomial smoothing spline ( jWahbal . Il990l . Section 1.2), 
there exists a unique decomposition of U(t): 



U{t) = U {t) + U 1 (t) 



m— 1 « 

J2vMt)+ / G m (t,u)D m U(t)du 



with U (t) G Hn 0o and U^t) G H^. U n ^ = {/(*) : D m f(t) = 0,t G T} nad 

U n -^ = {/(*) : absolutely continuous for i = 0, 1, • ■ • , m - 1, D m f(t) G C 2 {T)} are the RKHSs 

with reproducing kernel 11$ (s,t) and IZfj^s.t) respectively, where </>i(t), G m (t,u), 11$ (s,t) and 
TZjj^s.t) are defined in Theorem [1] with C-2(T) = {f(t) : f T f 2 (t)dt < oo} the space of squared 
integrable functions defined on index set T. 

Given T = {tj : j = 1,2, ••• , J}, the tA(t) G ^7j & can be uniquely written as t/i(i) = 
X^=i Vjl^Uiitji + 7 lu 1 (^)f where 77^ 1 (-) G orthogonal to TZ^tj, ■) with inner product 

(ftfcfe, (O)**^ = J T D m n 0i (t v u)D m Vl)i (u)du = for j = 1,2, • ■ ■ , J. 

As a result, 



r 



{/) m f7(t) 



dt 



m—l 



T 

J J 



i=0 



v'R & u + {r) &1 (-),r) &i (-)) H 



By similar arguments, 



i(t) = i (t) + i 1 (t) 

n-1 J 
3=1 



i=0 



and 



'T 



where A (t) G 7fo. and Ai(t) G ft*, with 7^. = {fit) : D m+n f(t) = 0, t G T} and 

ft- = {f(t) : D'/(i) absolutely continuous for i = 0, 1, • • • , m + n - 1, D m+n f(t) G £ 2 (T)} the 
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RKHSs with reproducing kernel lZ Ao (s,t) and TZ Ai (s,t) respectively; r] Ai (-) G Hn A is orthogo- 
nal to TZ Ai (tj, ■) with inner product (TZ Ai (tj, ■) , Al {-)) u n - — f T D rn lZ Ai (tj,u)D m ri Ai {u)du = for 

A i 

J = 1, 2, • - - ,J. 

Note that Tfofo) = (T^fe, •), ^ = and r/ A (i,) = (7^ fo, ■), ^ (■))«*, = due 

"1 A i 

to the reproducing property of TZfj (tj, •) and V, A (tj, ■). It then follows from expression ([7]) that 

nPSS(t) =1 (y - - R u - <\> a a - R A (3)' (Y - ^ - R v - cf> a a - R A (5) 

+ \uv' R D v + \ A (3' R A (3 + {r]fj{-),rifj(-)) Hn + {r] A {-),r] Ai {-)) Un , 

which is minimized when (% (•), r] D {-)) H = (r) A (-), ?7Ai (■))«*, = °- Thus ' %i(-) = ^(0 = 
and we obtain the forms of U(t) and nPSS(t) as required. 

A.6 Proof of Corollary H 

We first take partial derivatives of nested penalized sum-of-squares nPSS(t) in Theorem [3] with 
respective to 1/, a and /3 and set them to zeros: 

0; (<^/i + + Q « + - r) = 0, (8) 

Rfj (0^ + M & i/ + Q a + - V) = 0, (9) 
<j>' a {4* fit 1 + ^ + a a + - Y) = 0, (10) 

^(^iV + ^ + M^-y)^, (11) 

where Mfj = Rjj + JXjjI and Af ^ = R A + JA^J. It follows from equations ([2]) and ffTTj) that 





d n 





nPSS(t) 




d v 





nPSS(t) 




9 OL 





nPSS(t) 



/3= — v. 



Substituting them into equations (J8J) and (fit)]) with some algebra leads to 
from which we obtain 

19 



It is then straightforward to show 



as desired. 



A. 7 Proof of Lemma H 

Let U(t) = U(t) + A(t) and A(t) = D m A(t). From SDEs © and ©, 

D m U(t) = auW v (t), 
D m+n A(t) = a A W A (t). 

Thus, given the initial value /x, it can be sho wn that U(t) = ^™Lq Hi4>i{t) + crfj L. G m (t, u)Wu{u)d 



a (m — l)-fold integrated Wiener process (jSheppl . Il966l ). Similarly, A(t) = Yl^Z® a i4>m+i{t) + 
a\ j T G m+n (t, u)Wa(u)cLu, a (m + n — l)-fold integrated Wiener process. 

It is obvious that E {U(t)} = and E {Y} = 0. Given the mutually independent assumption of 

a, Wu(-) and W A {-), 

Coy {U(t),Y(t,)} = Cov{f/(t), U(tj)} 

= Cov Uj(t),U(tj)\ + Cov{i(t),i(t i ) 

= E{u(t)U(t s )} + E{A(t)A(t j )} 

m—l 
n-1 



i=0 



and 



Cov {F(t,), Y(t f )} = Cov {U{tj), U{t r )} + a, 



m—l 



i=0 
n-l 

^ 4>m+i(tj)(f>m+i(tj>) + (T^TZ^tj ,t f ) + <T 



i=0 



for j = 1, 2, • • • , J and f = 1, 2, • • • , J. The lemma holds. 
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A. 8 Proof of Theorem [4] 



By Lemma [2] and the results on conditional multivariate normal distribution (jSearld . Il982l ) 

E{U(t) | Y, a% a 2 a , a 2 £ } = Cov {U(t), Y} Cov" 1 {Y, Y} Y 

= [p^(t)^ + R' (t) + Pa<t>' a {t)<t>' a + PAR' A {t)] x 

[ Pv ( t>»<t>'n + Pa<f>a<f>' a + PaRa + R(J + JX ul] 1 Y 

= cp'^t) (p^E^) Y + i^E" 1 ^ 

(pafaZ;^) Y + p A R' A (t)^ p Y 



where p M = <tJ/o£, p Q = a^/afj, p A = a A /afj, J\ v = a^/afj and E 



Pp<f>p<f>'p 



S a with 



>Sp a — Pa'fia'fi'a + S = Pa^a'Pa + PaRa + -^(7 + J^ul- We are going to evaluate the limits of 

P»<p'p^~p!p a l P*<t>'cPpLa aIld S pi Q Whel1 ~> + °° aIld P« -> + °°- 



P PvPui rava^p^pa 

It can be verified that 



= ~ s- p l<f>, {cp'pS^y 1 { i + p; 1 <^s-\ (12) 

It follows that S" 1 = lim S" 1 = - (^flT 1 ^) -1 ^ < S" 1 = S" 1 - ST^E^XflT 1 

and ^"Vp = £p|« and S'Vp = S' 1 ^. 



As a result, 



5]"^= lim lim E" 1 

p M ->+oo p Q ->-+oo 



By expression (IT2"|) . 

P^'^~pl Pa = Pp 



Pa ' 



It follows that lim lim p a <b'Y, nn =E , d>,,\ n S . By similar arguments, lim lim p a d>' £„ „ 

^alp^alp 5, • 
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Hence, S" 1 "K = v>, p A T,^ Y = (3, lim Urn p,^ 1 Y = p and Hm Urn P^'J^ Y 
ex. The theorem holds. 



A. 9 Proof of Proposition [T] 

When m = 2 and n = 1, the SDEs (j2]) and (j3J) can be written as, 

D 1 0(t) = CO(t)+DW{t), 



where 6(t) = < 
As a result 



f/(t) 
D x U{t) 
A{t) 



( 



y c 



1 
1 




/ 

, c = 



\ 




\ a a J 



and W(t) 



W v {t) 
W A (t) 



j+1 = exp(C5 i )0 i + / exp{C(5j -u)}DW(tj + u)du 

Jo 

= GjOj + ujj, 



where = exp(C^) = / + + S]CC/2 



1 °? 2 

1 ^ 
1 



and ~ N 3 (0, Wj) with 



= / exp{C(5 j - «)}£>£>' exp{C'(5 3 - - 

JO 

/ <5 3 9 8 5 9 <5 2 9 <5 4 9 <5 3 9 \ 



6 a A 



2 a A 



8j°A ) 



as required. 
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Figure 1: (a) Plot of protein mass spectrometry data: observed intensities versus mass to charge 
ratio m/z; (b) Posterior mean ( — ) and 95% credible interval of U (red shades); (c) Posterior mean 
and 95% credible interval of U for a local region; (d) Posterior mean and 95% credible interval of 
rate of intensity changes DU. 
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Table 1: Average MSE and the interquartile range of MSE (in parentheses) for Bayesian nonpara- 
metric regression with nGP prior (BNR-nGP), smoothing spline (SS), wavelet method with the soft 
minimax threshold (Wavelet 1), wavelet method with the soft Stein's unbiased estimate of risk for 
threshold choice (Wavelet2) and Hybrid adaptive spline (HAS). 



Example BNR-nGP SS Wavelet 1 Wavelet2 HAS 

Blocks 0.950(0.166) 3.018(0.248) 2.750(0.748) 1.237(0.341) 0.539(0.113) 

Bumps 1.014(0.185) 26.185(0.787) 3.433(0.938) 1.195(0.282) 0.904(0.258) 

Heavisine 0.320(0.058) 0.337(0.087) 0.702(0.230) 1.620(0.460) 0.818(0.122) 

Doppler 0.989(0.183) 3.403(0.361) 1.517(0.402) 0.695(0.179) 3.700(0.534) 

MS Region l(xl0~ 3 ) 1.498(0.266) 2.293(0.513) 2.367(0.616) 6.048(3.441) 72.565(39.596) 

MS Region 2(xl0~ 3 ) 0.840(0.375) 0.798(0.490) 0.948(0.587) 1.885(0.493) 7.958(5.559) 
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